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l/*~) Abstract 

^^ Meta-analysis of genome-wide association studies is increasingly popular and many mcta- 

analytic methods have been recently proposed. A majority of meta-analytic methods combine 
information from multiple studies by assuming that studies are independent since individuals 

(f~) collected in one study are unlikely to be collected again by another study. However, it has 

become increasingly common to utilize the same control individuals among multiple studies 
to reduce genotyping or sequencing cost. This causes those studies that share the same 

S^ individuals to be dependent, and spurious associations may arise if overlapping subjects are 

not taken into account in a meta-analysis. In this paper, we propose a general framework for 
meta-analyzing dependent studies with overlapping subjects. Given dependent studies, our 
approach "decouples" the studies into independent studies such that meta-analysis methods 
assuming independent studies can be applied. This enables many meta-analysis methods, 
such as the random effects model, to account for overlapping subjects. Another advantage is 
that one can continue to use preferred software in the analysis pipeline which may not support 
overlapping subjects. Using simulations and the Wellcome Trust Case Control Consortium 
data, we show that our decoupling approach allows both the fixed and the random effects 
models to account for overlapping subjects while retaining desirable false positive rate and 
power. 
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1 Introduction 

Meta-analysis of genome-wide association studies is becoming increasingly popular [TJ[2] . Inves- 
tigators combine multiple studies into a single meta-analysis to increase sample size and thereby 
the statistical power to detect causal variants. In the past two to three years, large-scale meta- 
analysis have been highly successful dramatically increasing the number of known associated 
loci in many human diseases pip] . 

There exist a number of categories of methods in meta-analysis. The fixed effects model (FE) 
assumes that the effect sizes are fixed across the studies and is powerful when the assumption 
holds [5j6j. The random effects model (RE) assumes that the effect sizes can be different between 
studies, a phenomenon called heterogeneity 17]. A recently proposed random effects model is 
shown to be more powerful than FE if the data are heterogeneous J8I. Additional categories 
of methods include the p- value-based approaches (9 10 , the subset approaches assuming that 



the effects are present or absent in the studies [ll|[l2j, and the Bayesian approaches [13f|l5| . A 
majority of these methods combine information from multiple studies by assuming that studies 
are independent since individuals collected in one study are unlikely to be collected again by 
another study. 

However, it has become increasingly common to utilize the same control individuals among 
multiple studies to reduce genotyping or sequencing cost [16]. This causes those studies that 
share the same individuals to be dependent, and spurious associations may arise if overlapping 
subjects are not taken into account in a meta-analysis. A naive solution would be to manually 



split the overlapping subjects into distinct studies, which can be sub-optimal 17 and may not 
be practical if genotype data are not shared. 

Recently, Lin and Sullivan proposed a meta-analytic approach that takes into account over- 
lapping subjects [17]. This approach provides an optimal and elegant solution to account for 
the correlation structure between studies caused by the overlapping subjects. However, their 
method is exclusively based on the fixed effects model. Recent studies extended this method 



to the p- value-based approach [10] and the subset approach 11 , but to date, it is unclear 
how to account for overlapping subjects in the random effects model and other meta-analytic 
approaches. 

In this paper, we propose a general framework for meta- analyzing dependent studies with 
overlapping subjects, which is an extension of the Lin and Sullivan approach [17]. Given the 
correlation structure between studies, the core idea is to uncorrelate or decouple the studies 
into independent studies such that meta-analysis methods assuming independent studies can be 
applied. The advantage of our decoupling approach is that it enables many meta-analysis meth- 
ods, such as the random effects model, to account for overlapping subjects. Since our approach 



involves only the data-side change rather than the method-side change, one can continue to use 
preferred software in the existing analysis pipeline which may not support overlapping subjects. 
We analytically show that the Lin and Sullivan approach is one special case in our framework in 
which the fixed effects model is applied after decoupling studies. We demonstrate the utility of 
our approach by performing a meta-analysis of three autoimmune diseases using the Wellcome 
Trust Case Control Consortium data [16] . 

2 Results 

2.1 Overview of the Method 

Many traditional as well as recently proposed meta-analytic methods exist, but a majority of 
them cannot account for overlapping subjects (Table [l]). Moreover, even if there exists a solution 
for overlapping subjects in a specific method such as the fixed effects model, the software one 
prefers may not support overlapping subjects. For example, widely used software METAL [18] 
and MANTEL |2J do not support overlapping subjects. We propose a general framework that 
allows different approaches to deal with overlapping subjects, which extends the Lin and Sullivan 
approach |17|. Through this paper, we will often use the term "correlation of studies" to refer to 
the correlation of statistics (typically z-scores) of studies in short. The intuition is that the more 
studies are correlated, the less information they contain with respect to the summary statistic of 
the meta-analysis. For example, if two studies are perfectly correlated (r = 1), their combined 
information is not better than a single study's information. In Figure [TJ we have three studies 
A, B, and C whose statistics are correlated. For simplicity, their variances are set to 1.0. Our 
approach "decouples" the studies into independent studies that have the same information with 
respect to the summary statistic. The penalty for the decoupling is the increased variances. The 
variance of the study B has increased the most drastically (2.52), because its correlations to A 
and C were large (0.5 and 0.3). The size of the circles denotes the amount of information in 
terms of the inverse variance, showing that B has the smallest information. We can then use 
the decoupled studies in the downstream meta-analysis method. If the downstream method is 
the fixed effects model, our decoupling approach is equivalent to the optimal method of Lin and 
Sullivan [l7| (See Methods). 



2.2 False positive rate and power simulations 

We perform simulations to examine the false positive rate and power of our decoupling approach. 
We suppose that ten different studies are combined in a meta-analysis to test a genotyped marker. 
We make an assumption that the studies are uniform in their sample sizes and the marker 



allele frequencies. We also assume that the sample sizes are sufficiently large. Under these 
assumptions, simulating genotype data is approximately equivalent to simulating the observed 
effect sizes directly from a normal distribution. For simplicity, we assume that the variances 
of effect sizes are uniformly 1.0. We use the significance threshold a = 0.05 for all simulations 
below. 

We first simulate the null model in which the marker is not associated with a disease in all 
studies. We assume that the correlation nj between study i and j is uniform for all study pairs 
i ^ j. This defines our covariance matrix fi. Then we sample the vector of observed effect sizes 
x from N(0, fi) 10,000 times. We vary nj from 0.0 to 0.9 and measure the false positive rates 
for different meta-analytic approaches. 

In Figure [2]A., we compare the false positive rate of the methods for the fixed effects model. 
The naive FE method refers to the traditional fixed effects model unaccounting for the cor- 
relations. The naive method shows dramatically inflated false positive rate as expected, since 
the correlations are ignored. The false positive rate becomes exacerbated as the unaccounted 
correlation increases, up to 0.52 at rij = 0.9. The decoupling FE refers to our decoupling ap- 
proach applying FE after decoupling the studies. Both the decoupling FE and the Lin-Sullivan 
approach correctly control the false positive rate. The two methods yield the identical results, 
since they are equivalent (See Methods). The average false positive rate of the two methods 
over all correlation values r^ was identically 0.050. 

In Figure [2p, we assess the false positive rate of the methods for the random effects model. 
The naive RE method refers to the Han and Eskin random effects model |8| unaccounting for 
the correlation. The naive method shows dramatically inflated false positive rate that increases 
with the correlation. The decoupling RE refers to our decoupling approach applying the Han 
and Eskin random effects model p] after decoupling the studies. The decoupling RE correctly 
controls the false positive rate, with some conservative tendencies. The average false positive 
rate of the decoupling RE over all correlation values r^ was 0.034. 

In Figure [2p, we simulate the alternative model assuming that the fixed effects model is the 
generative model. We fix the correlation to be r^ = 0.20. We sample x from N({3e, fi) where 
we vary the mean effect /3 from 0.0 to 2.0. The decoupling FE shows power increase as the mean 
effect increases. The power is identical to the Lin-Sullivan approach, since the two methods are 
equivalent. Note that the naive FE method is not shown in the power comparison since its false 
positive rate is not properly controlled. 

In Figure [2p, we simulate the alternative model assuming that the random effects model 
is the generative model. Again, we fix the correlation to be nj = 0.20. We sample x from 
N(/3e, ft + t 2 I) where we vary both /3 and the heterogeneity r 2 . The power of the decoupling 
RE is shown for different configurations of the models. We find that the power shows typical 



characteristics of the random effects model; the power increases as the mean effect increases 
and as the heterogeneity increases [8]. This shows that when we direct decoupled studies into 
the random effects model, the method has power to detect alternative models that the random 
effects model is designed for. 

In summary, our simulations stress a few points; (1) The decoupling approach can be flexibly 
applied to both the fixed and the random effects models. (2) When applied to the fixed effects 
model, the decoupling approach shows equivalent results to the Lin-Sullivan approach. The 
method accurately controls the false positive rate and shows power to detect the alternative 
model. (3) When applied to the random effects model, the decoupling approach controls the 
false positive rate with conservative tendencies, while retaining the power to detect alternative 
models of the random effects model. 

2.3 Applications to the Wellcome Trust Case Control Consortium data 

We apply our decoupling approach to the Wellcome Trust Case Control Consortium (WTCCC) 
data. The WTCCC has performed genome-wide association studies of seven diseases (bipolar 
disorder, coronary artery disease, Crohn's disease, hypertension, rheumatoid arthritis, type 1 
diabetes, and type 2 diabetes, or BD, CAD, CD, HT, RA, T1D, and T2D in short). Using these 
data, we perform meta-analysis of three autoimmune diseases (CD, RA, and TID). These data 
sets are a good example of overlapping subjects because all of the controls are shared between 
diseases. The WTCCC performed a combined analysis using the genotype data of these three 
diseases and reported four significant loci (See Supplementary Table 11 of Burton et al. [16]). We 
want to show the utility of our approach by reproducing the same results only using summary 
statistics without genotype data. 

We first calculate the correlation matrix between the seven diseases using the Lin and Sullivan 
formula (See equation Q in Methods). Figure [3] shows that the studies are positively correlated 
due to the shared control design. The correlations are at around r = 0.4 at all pairs of the 
diseases. These uniform correlations reflects the unique study design that all controls are shared 
and the similar numbers of cases are collected in all diseases. 

We perform the meta-analysis of three autoimmune diseases (CD, RA, and TID) using the 
log odds ratios and their standard errors. We consider 397,450 SNPs that passed quality control 
for all three diseases and the minor allele frequency is greater than 1%. We first apply the naive 
fixed effects model (FE) and the random effects model (RE) which do not take into account the 
correlation structure. Figure [4]^ shows that the qq-plot is highly inflated for both the naive FE 



and the naive RE (the genomic control factors 19 , Afe = 1-86 and Are = 1-62, excluding the 



MHC region). Since the p- values are highly inflated, further downstream analyses using these 
naive approaches can be susceptible to false positives. 



We then apply our decoupling approach to account for the correlation structure. We con- 
struct the decoupled studies and apply FE and RE. Figure [4j3 shows that the qq-plot is much 
better calibrated (Afe = 1-05 and Are = 0.82). Both the decoupling FE and RE approaches 
identified the four loci as significant (P < 1.2 x 10 , Bonferroni corrected for 397,450 tests) 
that were reported in the combined analysis of the WTCCC study [16] (Table [2]). This shows 
that our approach was able to reproduce the previously reported results only using the summary 
statistics. 

Moreover, our decoupling approach has an advantage over the combined analysis [16] . Since 
our approach can utilize both FE and RE, one can have good power to detect both the homoge- 
neous and the heterogeneous effects [8J. By contrast, the combined analysis can be thought of as 
similar to FE and may not have good power to detect heterogeneous effects. In the Manhattan 
plot (Figure pb, the notable peaks are the PTPN22 gene in the chromosome 1 and the major 
histocompatibility complex (MHC) region in the chromosome 6. Both loci are known to play 
an important role in autoimmune diseases (3jpJ. At both loci, the decoupling RE yields more 
significant p- values than the decoupling FE (PTPN22: P FE = 5 x 10" 23 and P RE = 1 x 10~ 29 , 
MHC: Ppe = 5 x 10~ 80 and Pre = 8 x 10 -181 ). This is because of the heterogeneous nature of 
these two loci that they are strongly associated to RA and T1D but weakly to CD (Table [2b. 
This shows that our decoupling approach can allow one to flexibly apply different meta-analytic 
approaches that are optimized for different situations. 

Finally, we examine the robustness of our decoupling approach by adding the other four dis- 
eases as noisy data into the meta-analysis. Figure[5p shows that when we meta-analyze all seven 
diseases, the absolute magnitudes of the significance of PTPN22 and MHC loci are reduced, 
but they are still significant for both the decoupling FE and the decoupling RE. Moreover, the 
relative significance gain of the decoupling RE compared to the decoupling FE is still largely 
pronounced (PTPN22: P FE = 1 x 10" 11 and Pre = 9 x 10~ 18 , MHC: P FE = 1 x 10" 28 and 
Pre = 1 x 10- 89 ). 

3 Materials and Methods 

3.1 Meta-analytic methods 

We first briefly describe some of the existing meta-analytic methods. 

Fixed effects model 

The fixed effects model approach (FE) assumes that the magnitude of the effect size is fixed 
across the studies [5[p]. The two widely used methods are the inverse- variance- weighted effect 



size method [I] and the weighted sum of z-scores method [2J. Since the two methods are ap- 
proximately equivalent 18], we only describe the inverse- variance- weighted effect size method. 
Let X\ , ..., Xn be the effect size estimates in N independent studies such as the log odds ratios 
or regression coefficients. Let SE(Xj) be the standard error of X{ and let V% = SE(JQ) 2 . Let 
Wi = V~ be the inverse variance. The inverse-variance-weighted summary effect size is 

Xfe ~ ~ew ■ (1) 

The variance of X FE is 

Since the standard error of X FE is SE(Xfe) = y^W^ , we can construct a summary z-score 

^ X FE _ E WjXj 

FE SE(X FE ) ^W % 

that follows -/V(0, 1) under the null hypothesis of no associations. The p-value is calculated 

PFE = 2$(-\Z FE \) 
where $ is the cumulative density function of the standard normal distribution. 

Random effects model 

In the random effects (RE) model, it is assumed that the effect size varies among studies, 
a phenomenon called heterogeneity. RE assumes that the effect size follows the probability 
distribution with variance r 2 . There are several approaches to estimate the variance r 2 , the 
most common one being the moment-based estimator of DerSimonian and Laird [7] . Given the 
estimate (f 2 ) of r 2 , the summary effect size estimate is calculated as 

y Ejw^ + fy'x 

A RE = ; „ _i 

EO^ + f 2 ) 

The standard error of Xre is SE(Xre) = vE (^T + ^ 2 ) • ^ n ^ e traditional RE approach, 

one constructs a z-score statistic 

^RE 



SE(Xre) 



and the p- value is computed as pre = 2$ ( — | Zre | ) . Recently, Han and Eskin found that the 
traditional RE is conservative and rarely achieves higher power than the fixed effects model (8j. 
This is because of the conservatie null hypothesis of the traditional RE that implicitly assumes 
heterogeneity under the null. They proposed a new random effects model that corrects for this 
problem, and the test statistic is 

where ft and f 2 are the maximum likelihood estimations of mean and variance of the effect size, 
respectively, which can be found by an iterative procedure. The statistic follows a half and 
half mixture of xfo) an d xft) un der the null 
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The p-value can be calculated by using the 
asymptotic distribution or using a pre-constructed p-value table for a more accurate calculation 
accounting for the small number of observations [8] . 

P-value-based approaches 

There exist meta-analytic approaches combining p-values instead of effect sizes, the most tradi- 
tional one being the Fisher's method J9J. The Fisher's method combines the p-values p\, ...,pn 
by constructing a statistic 

N 
^Fisher = -2 ^ lo &(Pi) 

which follows X%N) un der the null. The p-value-based approaches have advantages that it can be 
used even when one does not have information about the direction of effects. A recently proposed 



p-value-based approach by Zaykin and Kozbur can take into account overlapping subjects 10 



Subset approaches 

Subset approaches have similarities to the fixed effects model, but the difference is that one 
assumes that the effects can exist in only a subset of the studies. Han and Eskin computes 
a statistic called "m- value", a posterior probability that the effect is present in a study [12] . 
The m-values are incorporated in the weighted z-score FE approach to upweight studies with 
high m-values. The p-value is assessed using the importance sampling. Bhattacharjee et al. fid] 
proposed an approach that computes FE statistics using all possible subsets of studies in a 
meta-analysis and uses the maximum statistic. The method expedites the enumeration of all 
possible 2^ — 1 possible subsets of N studies by using a novel statistical technique. This method 
can account for overlapping subjects. 



Bayesian approaches 



Morris proposed a Bayesian approach optimized for the trans-ethnic meta-analysis 14 . This 
method utilizes the Markov Chain Monte Carlo (MCMC) procedure to navigate through possible 
disease models. In his MCMC, closely related populations have higher chance to have similar 
effect sizes to increase the statistical power to detect heterogeneity caused by the population 
spectrum. Wen [13] proposed a new method taking into account heterogeneity in the data using 



the hierarchical model in the Bayesian framework. Wen 15 recently extended this method to 
a multi-way table modeling, which can account for the correlation structure between studies or 
the overlapping subjects. 

3.2 Lin and Sullivan approach 

Lin and Sullivan proposed a systematic approach for dealing with overlapping subjects for the 



fixed effects model 17 . The first step of their approach is to analytically calculate the correlation 
of the statistics X±, ..., Xn caused by the overlapping subjects. Let x be the vector of effect size 
estimates x = (X±, ...,Xn) and let 

C = VvlNxN 

be the correlation matrix of X where r^ denotes the the correlation between Xi and Xj. r^ is 
analytically approximated with the formula 



n ijO\l ~ — + n ijl\ l ° j ° ) / y/ntHj (4) 



mi, riio, and n» (or riji, Ujq, and rij) are the number of cases, the number of controls, and the 
total number of subjects in the ith (or jth) study, respectively, riiji and n^o are the numbers of 



cases and controls that overlap between the ith and jth studies. See Bhattacharjee et al. 11 for 
an extended formula for the situation that some cases in one study are controls in other studies. 
Given the correlation matrix C, it is straightforward to calculate the covariance matrix fi. 

The second step of the Lin and Sullivan approach is to optimally take into account fi in the 
testing. The optimal fixed effects model meta-analysis statistic is 

e T n- 1 x 

Lin " e^ft-ie (5) 

where e is the vector of ones (e = (1, ..., 1)). The formal proof for the optimality of this statistic 



is shown in 21 and 22 . We also present a simple reasoning for deriving this statistic in 
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Supplementary Materials. The variance of the statistic is given 

Var (Xua) = ^r^ (6) 

The z-score -^Lin/\/Var (Xun) is calculated to obtain the p-value. Note that in a special case 
that Xi, ...,Xiv are independent (il is a diagonal matrix), Xun equals to Xfe arid Var (Xun) 
equals to Var (Xfe)- 

3.3 Decoupling approach 

We extend the Lin and Sullivan approach to a general framework that can be applied to a wide 
range of meta-analytic methods. Our approach "uncorrelates" or "decouples" correlated studies 
into independent studies whose standard errors are updated to account for the decoupling. 
Suppose that we are given the effect sizes x, the standard errors of them s, and the correlation 
matrix C computed by the formula Q. The decoupling procedure is the following. 

1. Keep the original x. 

^Decoupled ^ X 

2. Compute the covariance matrix of x. 
CI <— Diag(s) • C • Diag(s) 

3. Compute the decoupled covariance matrix. 
^Decoupled <- Diag(e T fl~ l )~ 1 

4. Update the standard errors. 

SDecoupledH <~ V^Decoupled [M] for each i = l,...,N 

5. Use xrjecoupied and srj ecoup i ec j in the downstream meta-analysis. 

Diag(s) denotes a diagonal matrix whose diagonals are s. The brakets [ ] denote the index of 
an element of a vector or a matrix. In Supplementary Materials, we present a simple R code 
performing this procedure. 

We give a simple working example of this procedure. Suppose that we have two effect sizes 
X\ and Xi- For simplicity, let their variances be 1.0. Under the fixed effects model, the best 
summary estimate of effect size will be {X\ + X2)/2 and its variance will be 1.0/2 = 0.5, which 
will be the correct variance if the two studies are independent (ri2 = 0). Now consider the case 
that X\ and X2 are highly correlated (712 = 0.99). Intuitively, since they are highly correlated, 
the information they contain is not much better than the information a single study contains. 
The decoupling formula gives us the new variance of each study increased to 1.99. When we 
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plug the new variances into the fixed effects model, the variance of the summary effect size 
will be 1.99/2 = 0.995 ~ 1.0 showing that as expected, the uncertainty in the final estimate is 
approximately the same as the uncertainty that we would obtain with a single study. 

Observation 1. Using the decoupled studies in the fixed effects model is equivalent to the Lin 
and Sullivan approach. 

Proof. Given a covariance matrix fi, our decoupling approach will calculate the updated stan- 
dard errors Srj eC oupled by calculating ^Decoupled- Since firjecoupled is a diagonal matrix, the 
following relationship holds 

^Decoupled = Diag (SDecoupled) ' -^^(SDecoupled) 

On the other hand, given an effect size vector x and standard errors s, the standard fixed effects 
model formulae in equation (fTl) and (I2j can be written 

e T V" 1 x 

-*FE 



and 



^FE 



e T V- l e 
1 



e^V-ie 
where V = Diag(s) ■ Diag(s). If we plug SDecoupled into this formula, 

e^V^x 

-^FE 



e r (J>mg(s Decoup ied) • -Diag(s DeC o U pied))~ 1 x 
e T (Z?ia5(s Decoup ied) • Dia5f(s Decoup i ed ))- 1 e 
e T fJ _1 x 

e "Decoupled* 
43 "Decoupled 45 

e T Diag(e T n- 1 )x 
e T Diag{e T Ct- l )e 



X\An 



where Xun is in equation §5§. Similarly, we can show that Vfe equals to Vun m equation Q. 
Therefore, applying our decoupling approach to the fixed effects model is equivalent to the Lin 



an Sullivan approach 17 
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4 Discussions 

We proposed a general framework for dealing with overlapping subjects in a meta-analysis. The 
core idea is to "decouple" the correlated studies into independent studies and use them in the 
downstream meta-analysis. Our approach can flexibly allow many meta-analytic methods, such 
as the random effects model, to account for overlapping subjects. The simulations and the 
applications to the WTCCC data support the utilities of our approach. 

Since our approach involves only the data-side change rather than the method-side change, 
one advantage is that one can continue to use preferred software in the existing analysis pipeline 
which may not support overlapping subjects. This can be important in practice, since it can 
be inconvenient to change a pipeline. For example, one may have been using METAL 18 or 



MANTEL |2| for automatically detecting strand inconsistencies between studies and for auto- 



matically applying the genomic control 19 . Given new data with overlapping subjects, one 
does not need to switch to different software supporting overlapping subjects but can simply 
update the standard errors using our approach and continue to use the existing pipeline. 

In this paper, we primarily focused on dealing with overlapping subjects, but our decoupling 
approach can be applied to any contexts of meta-analysis where the inputs are correlated. 
For example, in an eQTL study, multiple tissues can be analyzed together in a meta-analysis 



framework 23 . Since tissues of the same individual are correlated, this results in a meta-analytic 



problem where the inputs are correlated. In such cases, our decoupling approach with FE and 



RE methods can be applied to detect both tissue-specific and shared eQTLs 23 . 

The limitation of our approach is that the optimality is guaranteed only under the fixed 
effects model. For example, our simulations show that our approach has some conservative 
tendencies under the random effects model, indicating that our approach may not be optimal, 
although we showed that it works well in the simulations and the WTCCC data. Optimal 
solutions to account for correlated inputs for each different meta-analytic method will be an 
interesting topic for further research. 

We note that one should be careful in interpreting data based on the decoupled studies. For 
example, the heterogeneity testing is highly conservative when using decoupled studies and may 
not well detect true heterogeneity. Unfortunately, there is no good alternative to the Cochran's 
Q test and the I 2 estimate that can take into account correlated inputs yet. Developing such 
methods will also be an interesting future research area. 
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Category 



Method 



Supports overlapping subjects 



Fixed effects model 



Random effects model 



P-value-based approaches 
Subset approaches 



Bayesian approaches 



,6 
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Traditional FE [5 
Lin and Sullivan 
Traditional RE |7| 
Han and Eskin RE J 
Fisher's approach 9 
Zaykin and Kozbur 
Bhattacharjee et al. 



10 



11 



Binary Effects Model[l2j 
Hierarchical Bayesian approach 
Trans-ethnic approach 14 
Multi-way table approach 15 



13 



No 
Yes 
No 
No 
No 
Yes 
Yes 
No 
No 
No 
Yes 



Table 1: Different meta-analytic methods and their support for overlapping subjects. 
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Decoupling 
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Figure 1 : A simple example of our decoupling approach, ft and ^Decoupled are the covariance 
matrices of the statistics of three studies A, B, and C before and after decoupling respectively. 
The thickness of the edges denotes the amount of correlation between the studies. After de- 
coupling, the size of the nodes reflects the information that the studies contain in terms of the 



inverse variance. 
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Figure 2: The false positive rate and power of different methods under the fixed and the random 
effects model. (A, B) We simulate studies assuming the null model of no associations. (C) We 
simulate studies assuming the alternative model based on the fixed effects model where the effect 
sizes are fixed across studies. We vary the magnitude of the fixed effect size. (D) We simulate 
studies assuming the alternative model based on the random effects model where the effect sizes 
vary across studies with an additional variance r 2 . We vary both the magnitude of the mean 
effect size and the heterogeneity r 2 . 
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Figure 3: The correlation structure of the statistics of seven diseases in the WTCCC data. All 
seven diseases have approximately 2,000 cases and share 2,938 controls. 
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Figure 4: The qq-plots of a meta-analysis combining CD, RA, and T1D of the WTCCC data. 
The MHC region is excluded. 
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Figure 5: The Manhattan plots of meta-analyses combining a set of diseases of the WTCCC 
data. (A) We combine CD, RA, and T1D. (B) We combine all seven diseases. 
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6 Supplementary Materials 

6.1 Optimality of Lin and Sullivan approach 

We present a simple reasoning to derive and show the optimality of the Lin and Sullivan statistic 
in equation (pi). Note that the formal proof of the optimality is shown in 22 and 21 , and there 
can be many other proofs. The equation ^ is 

e T f2- 1 x 
Lin " e^n-ie 

where the notations are described in Methods. One simple way to derive this statistic is to 
translate the problem of finding a summary statistic to a linear regression framework. Consider 
that x is the dependent variables whose variance is fi. Then finding the best summary statistic 
is equivalent to finding the best mean or the intercept /3. Thus, we have a regression model 
including only the intercept term 

x = /3e + e 

where Var (e) = fi. By the standard generalized least square formula, the optimal estimate of (3 
will be 

J3 = (e T n- 1 e)- 1 e T n- 1 ^ 

Since e T Q~ 1 e is a scalar, this form is equivalent to Xun- 

6.2 R code performing decoupling approach 

## Decoupling approach. 

## Input: 

## s is the standard errors (possibly with NA) 

## C is the correlation matrix 

## (or one can specify C.inv (inverse of C) for speed-up) 

## Output: 

## updated standard errors after decoupling 

decoupling <- function (s, C, C.inv=NULL) { 

i <- !is.na(s) 

if (is. null (C.inv)) { 

Omega. inv <- solve (diag(s [i] ) 

} else { 

Omega. inv <- diag(l/s[i]) 

} 
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s.new <- sqrt (1/rowSums (Omega. inv)) 
s [i] <- s.new 
s 
} 



